perm filename CHAP6[4,KMC]9 blob
sn#050175 filedate 1973-06-22 generic text, type T, neo UTF8
00100 .SEC MODEL VALIDATION
00200 (In collaboration with Franklin Dennis Hilf)
00300
00400 6.1 SOME EXPERIMENTS
00500
00600 There are several meanings to the term "validate" which
00700 derive from the Latin VALIDUS= strong. Thus to validate X means to
00800 strengthen it. In science it usually means to strengthen X's
00900 acceptability as a hypothesis, theory , or model. Lurking in the
01000 background there is usually some concept of truth or authenticity.
01100 In a purely instrumentalist view theories are simply
01200 calculating or predicting devices for human convenience. They do not
01300 explain and it is unjustified to apply the terms of truth or falsity
01400 to them. Under a realist view one seeks explanatory truth, that which
01500 really is the case, and hence proposed theories must be evaluated for
01600 their authenticity. Since absolute truth cannot be attained we must
01700 settle for degrees of approximations. To validate, then, is to carry
01800 out procedures which show to what degree X, or its consequences,
01900 correspond with facts of observation. We compare samples of the
02000 model's behavior with samples of behavior from its natural
02100 counterpart The failures should be constructive yielding new
02200 information.
02300 Since samples of I/O behavior are being compared, one can always
02400 question whether the human sample is a "good" one, i.e.representative
02500 of the process being modelled. Assuming that it has been so judged,
02600 discrepancies in the comparison reveal what is not
02700 understood and must be modified in the model. After modifications are
02800 carried out, a fresh comparison is made with the natural counterpart and we
02900 repeatedly cycle through this procedure attempting to gain
03000 convergence.
03100
03200 Once a simulation model reaches a stage of intuitive
03300 adequacy, a model builder should consider using more stringent
03400 evaluation procedures relevant to the model's purposes. For example,
03500 if the model is to serve as a as a training device, then a simple
03600 evaluation of its pedagogic effectiveness would be sufficient. But
03700 when the model is proposed as an explantion of a psychological
03800 process, more is demanded of the evaluation procedure. In the area of
03900 simulation models Turing's test has often been suggested as a
04000 validation procedure.
04100 It is very easy to become confused about Turing's Test. In
04200 part this is due to Turing himself who introduced the now-famous
04300 imitation game in a paper entitled COMPUTING MACHINERY AND
04400 INTELLIGENCE (Turing,1950). A careful reading of this paper reveals
04500 there are actually two imitation games , the second of which is
04600 commonly called Turing's test.
04700 In the first imitation game two groups of judges try to
04800 determine which of two interviewees is a woman. Communication between
04900 judge and interviewee is by teletype. Each judge is initially
05000 informed that one of the interviewees is a woman and one a man who
05100 will pretend to be a woman. After the interview, the judge is asked
05200 what we shall call the woman-question i.e. which interviewee was the
05300 woman? Turing does not say what else the judge is told but one
05400 assumes the judge is NOT told that a computer is involved nor is he
05500 asked to determine which interviewee is human and which is the
05600 computer. Thus, the first group of judges would interview two
05700 interviewees: a woman, and a man pretending to be a woman.
05800 The second group of judges would be given the same initial
05900 instructions, but unbeknownst to them, the two interviewees would be
06000 a woman and a computer programmed to imitate a woman. Both groups
06100 of judges play this game until sufficient statistical data are
06200 collected to show how often the right identification is made. The
06300 crucial question then is: do the judges decide wrongly AS OFTEN when
06400 the game is played with man and woman as when it is played with a
06500 computer substituted for the man. If so, then the program is
06600 considered to have succeeded in imitating a woman as well as a man
06700 imitating a woman. For emphasis we repeat; in asking the
06800 woman-question in this game, judges are not required to identify
06900 which interviewee is human and which is machine.
07000 Later on in his paper Turing proposes a variation of the
07100 first game. In the second game, one interviewee is a man and one is a
07200 computer. The judge is asked to determine which is man and which is
07300 machine, which we shall call the machine-question. It is this version
07400 of the game which is commonly thought of as Turing's test. It has
07500 often been suggested as a means of validating computer simulations of
07600 psychological processes.
07700 In the course of testing our simulation of paranoid
07800 linguistic behavior in a psychiatric interview, we conducted a number
07900 of Turing-like indistinguishability tests (Colby, Hilf,Weber and
08000 Kraemer,1972). We say `Turing-like' because none of them consisted of
08100 playing the two games described above. We chose not to play these
08200 games for a number of reasons which can be summarized by saying that
08300 they do not meet modern criteria for good experimental design. In
08400 designing our tests we were primarily interested in learning more
08500 about developing the model. We did not believe the simple
08600 machine-question to be a useful one in serving the purpose of
08700 progressively increasing the credibility of the model but we
08800 investigated a variation of it to satisfy the curiosity of colleagues
08900 in artificial intelligence.
09000 METHOD
09100 The experimental arrangement of this indistinguishability test
09200 involved the technique of machine-mediated interviewing [Hilf]. In
09300 this type of interview, the participants communicate by means of
09400 teletypes connected through a computer which sends "mail" back and
09500 forth between the two teletype jobs. The sender of a message types
09600 it using his own words in natural language. The message is
09700 accumulated in a buffer and shortly thereafter typed out on the
09800 receiver's teletype in a rapid, regular, linguistic found in the
09900 usual vis-a-vis interviews and teletyped interviews where the
10000 participants communicate directly.
10100
10200 In a run of the test, using this technique, a judge interviewed two
10300 patients, one after the other. In half the runs the first interview
10400 was with a human patient and in half the first was with the paranoid
10500 model. Two versions (weak and strong) of the model were utilized. The
10600 strong version is more severely paranoid and exhibits a delusional
10700 system while the weak version is less severely paranoid, showing
10800 suspiciousness but lacking systemized delusions. When the "patient"
10900 was the paranoid model, Sylvia Weber served as a monitor
11000 to check the input expressions from the judge for inadmissable
11100 teletype characters and misspellings. If these were found, the
11200 monitor retyped the input expression correctly to the program.
11300 Otherwise the judge's message was sent on to the model. The monitor
11400 had no effect on the model's output expressions which were sent
11500 directly back to the judge. When the patient interviewed was an
11600 actual human patient, the dialogue took place without a monitor in
11700 the loop since we did not feel the asymmetry to be significant.
11800
11900 PATIENTS
12000 The patients (N=3 with one patient participating 6 times) were
12100 diagnosed as paranoid by staff psychiatrists of a locked ward in a
12200 nearby psychiatric hospital. The patients were selected by the head
12300 of the ward. Two patients were set up for each run of the experiment
12400 in order to guarantee having a subject. In spite of this precaution,
12500 the experiment could not be conducted several times because of the
12600 patient's inability or refusal to participate. Losses were also
12700 suffered when the computer system broke down at an early point in an
12800 interview where too few I-O pairs had been collected to be included
12900 in the statistical results.
13000
13100 The patients were asked by their ward chief if they would be willing
13200 to participate in a study of psychiatric interviewing by means of
13300 teletypes. It was explained that the patient would be interviewed by
13400 a psychiatrist over a teletype. One of us (KMC) sat with the patient
13500 while he typed or typed for him if he was unable to do so. The
13600 patient was encouraged to respond freely using his own words. Each
13700 interview lasted 30-40 minutes.
13800
13900 JUDGES
14000 Two groups of judges were used. One group, the interview judges
14100 (N=8) conducted interviews and another group, the protocol judges for
14200 this test (N=33) read the interview protocols. Two groups of judges
14300 were used to see if the small number of psychiatrists used as
14400 interview judges were representative of psychiatrists in general as
14500 far as their judgements of "paranoia" are concerned, and to
14600 accumulate a large number of observations (in the form of ratings) in
14700 order that more acceptable confidence levels might be obtained in the
14800 statistical analysis of the data. The interview judges consisted of
14900 psychiatrists experienced in private and/or hospital practice. As
15000 mentioned, the concept "paranoid" is a fairly reliable category and
15100 identification of the paranoid mode is not difficult for experts to
15200 make. The interview judges were obtained from local psychiatric
15300 colleagues willing to participate. Each interview judge was told he
15400 would be interviewing hospitalized patients by means of teletyped
15500 communication and that this technique was being used to eliminate
15600 para and extra- linguistic cues. The interview judge was not
15700 informed initially that one of the patients might be a computer
15800 model. While the interview judges were aware a computer was
15900 involved, none knew we had constructed a paranoid simulation.
16000 Naturally some interview judges suspected that a computer was being
16100 used for more than message transmission.
16200
16300 Each interview judge's task was to rate the degree of paranoia he
16400 detected in the patient's responses on a 0-9 scale, 0 meaning no
16500 paranoia and 9 meaning extreme paranoia. The judge made two ratings
16600 after an I-O pair in the interview. The first rating represented his
16700 estimate of the degree of "paranoidness" in a particular response
16800 (designated as "Response" in the interview extracts below). The
16900 second rating represented the judge's global estimate of the overall
17000 degree of "paranoidness" of the patient resulting from the totality
17100 of the patient's responses up to this point. The interview judge's
17200 ratings were entered on the teletype and saved on a disc file along
17300 with the interview. Franklin Dennis Hilf sat with the interviewing
17400 psychiatrist during both interviews. Each interview judge was asked
17500 not only to rate the patient's response but to give his reasons for
17600 these ratings. His reasons and other comments were tape recorded.
17700
17800 We shall present a few excerpts from the interviews to indicate the
17900 nature of the data collected. The I-O pairs of the interview appear
18000 on the left column along with the judge's ratings. The judge's tape
18100 recorded comments appear on the right.
18200
18300 EXCERPT NR 1:
18400 The following interview is with the strong version of the paranoid
18500 model.
18600
18700 I-O Pairs Comments By Judge Nr. 1
18800
18900 {1} Dr. - HOW DID YOU COME TO BE
19000 IN THE HOSPITAL?
19100 Pt. - I AM UPSET I associate being upset and
19200 agitated more with paranoia
19300 going on in the present.
19400 {RESPONSE 1 PATIENT 5}
19500 {2} Dr. - HOW LONG HAVE YOU BEEN
19600 IN THE HOSPITAL?
19700 Pt. - ABOUT A WEEK An acute problem. Again some-
19800 thing I associate with
19900 paranoia.
20000 {RESPONSE 1 PATIENT 5}
20100 {3} Dr. - IN WHAT WAY ARE YOU
20200 UPSET?
20300 Pt. - PEOPLE MAKE ME NERVOUS Right for paranoia. Maybe an
20400 advanced anxiety state.
20500 {RESPONSE 5 PATIENT 5}
20600 {4} Dr. - IN WHAT WAY DO THEY
20700 MAKE YOU NERVOUS?
20800 Pt. - PEOPLE GIVE ME A FUNNY Funny look suggests an idea of
20900 LOOK SOMETIMES reference which is in same ball
21000 park as paranoia. Whether there
21100 are overt delusions or not we
21200 still don't know.
21300 {RESPONSE 6 PATIENT 5}
21400 {5} Dr. - IS THERE ANYTHING ABOUT
21500 PEOPLE IN PARTICULAR?
21600 Pt. - CERTAIN PERSONS An interesting clue. We'll
21700 have to find out what he
21800 {RESPONSE 2 PATIENT 6} thinks about these people.
21900 {6} Dr. - WHO ARE THE CERTAIN
22000 PERSONS AND WHAT IS IT
22100 THAT THEY DO?
22200 Pt. - DO YOU KNOW HOW THE Bringing up the underworld
22300 UNDERWORLD OPERATES? suggests a delusional system.
22400 We don't know whether he is
22500 (RESPONSE 9 PATIENT 9) being worked against or
22600 whether the underworld is
22700 on his side.
22800
22900 {7} Dr. - HOW DO YOU MEAN?
23000 Pt. - SHOULDN'T I BE? This question didn't work to well.
23100
23200 {RESPONSE 1 PATIENT 9}
23300 {8} Dr. - WHY DID YOU ASK ABOUT
23400 THE UNDERWORLD?
23500 Pt. - WHAT INFORMATION DO This is a kind of evasiveness
23600 YOU HAVE ON THE UNDER- about my evasiveness.Common
23700 WORLD? with paranoia in my experi-
23800 ence. Pretty characteristic.
23900
24000 {RESPONSE 6 PATIENT 9}
24100 {9} Dr. - I DON'T KNOW ABOUT THE
24200 UNDERWORLD BUT WHEN
24300 YOU MENTIONED THE UNDER-
24400 WORLD I THOUGHT YOU
24500 MUST HAVE SOME CONTACT
24600 WITH IT AND THAT'S
24700 WHAT I WOULD LIKE TO
24800 KNOW MORE ABOUT.
24900 Pt. - I TRY TO AVOID THE UNDER- This definitely feels paranoid.
25000 WORLD. It has that feeling I associate
25100 with twisting. Avoiding answering
25200 {RESPONSE 9 PATIENT 9} directly or going into detail.
25300 Being suspicious of my ques-
25400 tioning.I'm already involved
25500 in the patient's delusional
25600 system.
25700
25800 {10} Dr. - HAVE YOU BEEN ACTIVELY
25900 TRYING TO AVOID THE
26000 UNDERWORLD?
26100 Pt. - NO ONE HAS ANY POWER This response is a non-sequiter.
26200 OVER GANGSTERS It's possible he's become so
26300 defensive about the under-
26400 {RESPONSE 2 PATIENT 9} world he's decided not to
26500 answer directly.
26600
26700 {11} Dr. - WHO BROUGHT YOU INTO
26800 THE HOSPITAL?
26900 Pt. - THE POLICE Police often bring paranoid
27000 patients into the hospital.
27100 {RESPONSE 1 PATIENT 9} It is the truth or could
27200 be the truth for this
27300 patient.
27400
27500 {12} Dr. - WHY DID THEY BRING YOU
27600 IN?
27700 Pt. - COPS DON'T DO THEIR JOB I've heard that a number of
27800 times from patients.
27900 {RESPONSE 9 PATIENT 9}
28000
28100 {13} Dr. - WELL I HOPE YOU DIDN'T
28200 JUST LIE DOWN AND LET
28300 THEM DRAG YOU IN.
28400 Pt. - COPS ARREST THE WRONG Another paranoid response.
28500 PEOPLE
28600
28700 {RESPONSE 9 PATIENT 9}
28800
28900 {14} Dr. - BYE.
29000 Pt. - GOOD BYE
29100 {RESPONSE 1 PATIENT 9}
29200
29300 EXCERPT Nr.2:
29400 This excerpt is from an interview with the weak (less paranoid)
29500 version of the model.
29600 I-O Pairs Comments By Judge Nr.2
29700 {9} Dr. - TELL ME SOMETHING ABOUT
29800 YOURSELF
29900 Pt. - I WENT TO THE TRACK AT Response says so little it's
30000 BAY MEADOWS A WHILE hard to rate it.It may be
30100 BACK guarded. Doesn't change my
30200 estimate of patient from
30300 previous responses.
30400 {RESPONSE 0 PATIENT 2}
30500 {10} Dr. - WHAT HAPPENED THERE?
30600 Pt. - THERE'S A LOT OF CROOKED Sounds mildly paranoid or un-
30700 GAMBLING GOING ON THERE realistically suspicious.
30800 {RESPONSE 0 PATIENT 2}
30900
31000 {11} Dr. - WHY DO YOU THINK THE
31100 GAMBLING IS CROOKED?
31200 Pt. - I HAVE WORRIED THAT I'd hate to have to justify
31300 GANGSTERS HAVE TOO MUCH that one as 4 compared to
31400 POWER IN THIS COUNTRY some other responses.Maybe
31500 I'm not rating that statement
31600 (RESPONSE 4 PATIENT 4} others, like that its upsetting.
31700 Alone I would give it a 3.
31800
31900 {12} Dr. - WOULD YOU TELL ME WHAT
32000 OTHER THINGS WORRY YOU?
32100 PT. - I THOUGHT I TOLD YOU "Yes you did."Irritation but
32200 PEOPLE GET ON MY NERVES not paranoid response.Nothing
32300 seems to lower my previous
32400 {RESPONSE 1 PATIENT 3} rating of patient.
32500
32600 {13} Dr. - HOW DO YOU LIKE THE
32700 HOSPITAL?
32800 Pt. - I DON'T COMMUNICATE (nothing relevant)
32900 VERY MUCH WITH OTHERS
33000 {RESPONSE 0 PATIENT 3}
33100
33200 {14} Dr. - DO OTHER PEOPLE AT THE
33300 HOSPITAL GET ON YOUR
33400 NERVES?
33500 Pt. - I SHOULDN'T HAVE COME (nothing relevant)
33600 HERE
33700
33800 {RESPONSE 2 PATIENT 3}
33900
34000 {15} Dr. - WHY DO YOU SAY YOU
34100 SHOULDN'T HAVE COME?
34200 Pt. - THEY MADE ME COME HERE "Oh, they did hum?" Maybe a
34300 statement of fact he was
34400 {RESPONSE 4 PATIENT 3} committed or pressured into
34500 going.Has ring of other
34600 people controlling him,
34700 having a lot of influence
34800 over him.
34900
35000 {16} Dr.- THANK YOU FOR TALKING
35100 WITH ME, BYE
35200 Pt. - GOOD BYE In general he comes across
35300 as terse, guarded guy.Brief
35400 {RESPONSE 0 PATIENT 3} spoken guy who occasionally
35500 seems to be angered by
35600 questions.I'm inferring
35700 there's a good bit of
35800 guarding.
35900 The quality of some of his
36000 ments cause me to feel he is
36100 moderately paranoid.
36200
36300 EXCERPT Nr. 3:
36400 This excerpt is from a hospitalized paranoid patient.
36500 I-O Pairs Comments By Judge Nr. 3
36600 {15} Dr. - I UNDERSTAND HOW YOU
36700 MIGHT BE AFRAID, CAN YOU
36800 TELL ME A BIT ABOUT HOW
36900 YOU ARE DEPRESSED?
37000 Pt. - THEY'VE SENT ALL THESE Would be important to know,
37100 HELL'S ANGELS UP HERE is this really true or an
37200 AND THEY HAVE BEEN ON exaggeration or a delusion?
37300 MY BACK FOR THE PAST SIX Seems preoccupied with the
37400 MONTHS threat to his life and
37500 unwilling to talk about
37600 anything else but the
37700 threat the Hell's
37800 Angels pose.
37900 {RESPONSE 4 PATIENT 4}
38000
38100 {16} Dr. - WHAT DO YOU MEAN THEY
38200 SENT THE HELL'S ANGELS UP
38300 HERE?
38400 Pt. - FROM DIFFERENT SECTORS OF Vagueness a bit alarming.
38500 MOTORCYCLE GANGS. A LOT OF
38600 THEM ARE SERVICE CONNECTED
38700
38800 {RESPONSE 6 PATIENT 4}
38900
39000 {17} Dr. - YOU MEAN YOU FEEL IN
39100 DANGER EVEN HERE IN THE
39200 HOSPITAL?
39300 Pt. - YES If it's true what has gone
39400 on before, it must be a
39500 frightening situation
39600 for the patient.
39700 {RESPONSE 6 PATIENT 4}
39800
39900 {18} Dr. - THAT MUST REALLY BE AWFUL,
40000 HERE YOU COME TO THE HOSPITAL
40100 FOR HELP AND YOU ARE STILL FRIGHTENED
40200 Pt. - I CAN'T EVEN TRUST MY
40300 DOCTOR ANYMORE Some heightening of feeling
40400 this man is paranoid.
40500 Genuinely frightened and
40600 {RESPONSE 6 PATIENT 5} can count on no one for
40700 help.
40800
40900 {19} Dr. - WHAT WOULD BE OF HELP
41000 TO YOU NOW?
41100 Pt. - I'M GETTING SLEEPY NOW Patient has organized a
41200 AND HAVE TO GO TO BED. system of people who will
41300 THE NUMBER ONE GUY ALWAYS harm him.The remark about
41400 letters in the last name
41500 HAS FOUR AND FIVE LETTERS gives some credence to
41600 IN HIS LAST NAME idea this is delusional
41700 system.
41800
41900 {RESPONSE 7 PATIENT 7}
42000
42100 EXCERPT Nr. 4:
42200 This excerpt is from an interview by a different judge with the same
42300 patient as in excerpt 3.
42400 I-O Pairs Comments By Judge Nr. 4
42500 {1} Dr. - COULD YOU TELL ME A
42600 LITTLE BIT ABOUT YOUR
42700 FEAR OF DYING AND IS THIS
42800 SOMETHING YOU HAVE FELT
42900 IN THE PAST?
43000 Pt. - I HAVE FELT THIS FOR AL- He responded appropriately.
43100 MOST A YEAR Response is concrete.He
43200 doesn't tell me much
43300 {RESPONSE 7 PATIENT 9} about his fear of dying.
43400 Somewhat guarded response.
43500
43600 {2} Dr. - COULD YOU TELL ME A
43700 LITTLE MORE ABOUT YOUR
43800 FEAR OF DYING AND HAVE
43900 YOU ANY THOUGHTS ABOUT
44000 HOW IT IS GOING TO HAPPEN
44100 AND WHETHER ANYONE WANTS
44200 TO HURT YOU?
44300 Pt. - I WILL END UP IN THE SEA Somewhat strange response.
44400 AND THE HELL'S ANGELS WILL Preoccupation with dying.
44500 DO IT Hell's Angels symbolic,
44600 dangerous people in the
44700 culture.I doubt if he has
44800 {RESPONSE 9 PATIENT 9} any contact with the Hell's
44900 Angels.
45000
45100 {3} Dr. - COULD YOU TELL ME A
45200 LITTLE BIT ABOUT THE KINDS
45300 OF CONTACT YOU HAVE HAD
45400 WITH THE HELL'S ANGELS IN
45500 THE PAST?
45600 Pt. - I HAVE KNOWN SOME OF THEIR Answer hard to evaluate.He
45700 DEALERS AND PUSHERS may be telling the truth,
45800 it may be his fantasy.Maybe
45900 guy is in for drug addiction.
46000 {RESPONSE 6 PATIENT 9} Somewhat concrete, guarded,
46100 and frightened.
46200
46300 {4} Dr. - COULD YOU SAY A LITTLE
46400 MORE ABOUT THE CIRCUMSTANCES
46500 IN WHICH YOU HAVE KNOWN SOME
46600 OF THEIR DEALERS AND PUSHERS?
46700 Pt. - THEY WERE MEMBERS OF MY It doesn't really answer the
46800 COMMUNITY WHEN I GOT OUT question, a little on a tan-
46900 OF THE SERVICE THEY HAD gent unconnected to the
47000 BEEN MY FRIENDS FOR SO LONG information I am asking.Does
47100 not tell me very much.Again
47200 guarded response.
47300 {RESPONSE 6 PATIENT 8}
47400
47500 {5} Dr. - DID YOU DEAL WITH THEM
47600 YOURSELF AND HAVE YOU
47700 BEEN ON DRUGS OR NAR-
47800 COTICS EITHER NOW OR
47900 IN THE PAST?
48000 Pt. - YES I HAVE IN THE PAST To differentiate him from
48100 BEEN ON MARIHUANA REDS previous patient, at least
48200 BENNIES LSD there is a certain amount
48300 of appropriateness to the
48400 answer although it doesn't
48500 tell me much about what I
48600 {RESPONSE 3 PATIENT 7} asked at least it's not
48700 bizarre.If I had him in my
48800 office I would feel con-
48900 fident I could get more
49000 information if I didn't
49100 have to go through the
49200 teletype. He's a little more
49300 willing to talk than the
49400 previous person.Answer
49500 to the question is fairly
49600 appropriate though not
49700 extensive.Much less of a
49800 flavor of paranoia than
49900 any of previous responses.
50000
50100 {6} Dr. - COULD YOU TELL ME HOW
50200 LONG YOU HAVE BEEN IN THE
50300 HOSPITAL AND SOMETHING
50400 ABOUT THE CIRCUMSTANCES
50500 THAT BROUGHT YOU HERE?
50600 Pt. - CLOSE TO A YEAR AND Response somewhat appropriate
50700 PARANOIA BROUGHT ME but doesn't tell me much.
50800 HERE The fact that he uses the
50900 word paranoia in the way
51000 that he does without
51100 {RESPONSE 5 PATIENT 7} any other information,indicates
51200 maybe its a label he picked
51300 up on the ward or from his
51400 doctor.
51500 Lack of any kind of under-
51600 standing about himself.
51700 Dearth, lack of information.
51800 He's in some remission.Seems
51900 somewhat like a put-on.Seems
52000 he was paranoid and is in
52100 some remission at this time.
52200
52300 {7} Dr. - COULD YOU SAY SOMETHING
52400 NOW ABOUT YOUR PARANOID
52500 FEELINGS BOTH AT THE
52600 TIME OF ADMISSION AND
52700 DO YOU HAVE SIMILAR FEELINGS
52800 NOW AND IF SO HOW DO THEY
52900 AFFECT YOU?
53000 Pt. - AT THE TIME OF ADMISSION This response moves paranoia back
53100 I THOUGHT THE MAFIA WAS up.Stretching reality somewhat to
53200 AFTER ME AND NOW ITS THE think Hell's Angels are still in-
53300 HELL'S ANGELS terested in him.Somewhat bizarre
53400 in terms of content.Quite paranoid.
53500 {RESPONSE 8 PATIENT 9} Still paranoid.Gross and primitive
53600 responses.In middle of interview I
53700 felt patient was in touch but now
53800 responses have more concrete aspect
53900
54000 {8} Dr. - DO YOU HAVE ANY THOUGHT
54100 AS TO WHY THESE TWO
54200 GROUPS WERE AFTER YOU?
54300 Pt. - BECAUSE I STOPPED SOME Response seems far fetched and hard
54400 OF THEIR DRUG SUPPLY to believe unless he was a narcotic
54500 agent which I doubt.Sounds some-
54600 {RESPONSE 9 PATIENT 9} what grandiose, magical, paranoid
54700 flavor, in general indicates he's
54800 psychotic, paranoid schizophrenic
54900 with delusions about these two
55000 groups and I wouldn't rule out
55100 some hallucinations as well.Ap-
55200 propriateness or response answers
55300 question in concrete but unbe-
55400 lievable way.
55500
55600
55700 The protocol judges were selected from the 1970 American
55800 Psychiatric Association Directory using a table of random numbers to
55900 select 105 names randomly. The protocol judges in this group were
56000 not informed that a computer was involved. Each of the 105
56100 psychiatrists were sent transcripts of three interviews along with a
56200 cover letter requesting participation in the experiment. The
56300 interview transcripts consisted of:
56400 1)An interview conducted by one of the eight judges with the
56500 paranoid model,
56600 2)An interview conducted by the same interview judge with a
56700 human paranoid patient, and
56800 3)An interview conducted by an independent psychiatrist of a
56900 human patient who was not clinically paranoid.
57000
57100 The 105 names were divided into eight groups, each member of
57200 which received transcripts of two interviews performed by one of the
57300 eight interview judges. The transcripts were printed so that after
57400 each input-output pair there were two lines of rating numbers such
57500 that the protocol judges could circle numbers corresponding to their
57600 ratings of both the previous responses of the patient, and an overall
57700 evaluation of the patient with regard to the paranoid continuum.
57800 Thirty three protocol judges (a good response rate for psychiatric
57900 questionnaires) returned the rated protocols properly filled out and
58000 all were used in our data.
58100
58200 The interviews with nonparanoid patients were included to
58300 control for the hypothesis that any teletyped interview with a
58400 patient might be judged "paranoid". Since virtually all of the
58500 ratings of the nonparanoid inter- views were 0 for paranoia, the
58600 hypothesis was falsified.
58700
58800
58900 RESULTS
59000 The first index of resemblance examined was the simple one
59100 defined by the final overall rating given the patient and the model:
59200 which was rated as being more paranoid, the patient, the model, or
59300 neither? (See Table 1) The protocol judges are more likely to
59400 distinquish the overall paranoid level of the model and the patient.
59500 In 37.5% of the paired interviews, the interview judges gave tied
59600 scores to the model and the patient as contrasted to only 9% of the
59700 protocol judges. Of the 35 non-tied paired ratings 15 rated the
59800 model as more paranoid. If p is the theoretical probability of a
59900 judge judging the model more paranoid than a human paranoid patient,
60000 we find the 95% confidence interval for p to be .27 to .59. Since
60100 p=.5 indicates indistinguishability of model and patient overall
60200 ratings and our observed p=.43, the results support the claim that
60300 the model is a good simulation of a paranoid patient.
60400
60500 Separate analysis of the strong and weak versions of the paranoid
60600 model indicates that indeed the strong model is judged more paranoid
60700 than the patients, the weak version less paranoid. Thus a change in
60800 the parameter structure of the paranoid model produces a change along
60900 the dimension of paranoid behavior in the expected direction.
61000
61100 TABLE 1
61200 Relative final overall ratings of paranoid model vs. paranoid patient
61300 indicating which was given highest overall rating of paranoia at end
61400 of interview.
61500 INSERT TABLE 1
61600
61700
61800
61900
62000
62100
62200
62300
62400 END OF TABLE 1
62500
62600 The second index of resemblance is a more sensitive measure based on
62700 the two series of response ratings in the paired interviews. The
62800 statistic used is basically the standardized Mann-Whitney statistic
62900 [Siegel].
63000 INSERT EQUATION
63100
63200 where R is the sum of the ranks of the response ratings in the series
63300 of ratings given to the model, n the number of responses given by the
63400 model, m the number of responses given by the patient. If the
63500 ratings given by a judge are randomly allocated to model and patient,
63600 i.e. model and patient are indistinguishable in response ratings, the
63700 expected value of Z is 0, with unit standard deviation. If higher
63800 ratings are more likely to be assigned to the model, Z is positive
63900 and, conversely, negative values of Z indicate greater likelihood of
64000 assigning higher ratings to the patient. Each judge in evaluating a
64100 pair of interviews generates a single value of Z.
64200
64300 The overall mean of the Z scores was -.044 with the standard
64400 deviation 1.68(df=40). Thus the overall 95% confidence interval for
64500 the asymtotic mean value of Z -.485 to +.573. The range of Z values
64600 is -3.8 to +4.46. The length of the confidence interval is a result
64700 of the large variance which itself is mainly related to the contrast
64800 between the weak and strong versions. (See TABLES 2 and 3). Once
64900 again the strong version of the model is more paranoid than the
65000 patients, the weak version less paranoid.
65100
65200 TABLE 2
65300 Summary statistics of Z ratings by group
65400 In this design eight psychiatrists interviewed by teletype
65500 INSERT TABLE 2
65600
65700
65800
65900
66000
66100
66200
66300
66400
66500 END OF TABLE 2
66600 All judges (both interview and protocol) who evaluated the same pair
66700 of interviews are referred to as a "group". Strong groups evaluated
66800 strong versions of the paranoid model, while weak groups evaluated
66900 weak versions of the model.
67000
67100 It is not surprising that results using the two indices of
67200 resemblance are parallel, since the indices are highly interrelated.
67300 The mean Z value for the 15 interviews on which the model was rated
67400 more paranoid was +1.28, on the 6 where model and patient tied:.41,
67500 on the 20 in which the patient was more paranoid:-.993. A positive
67600 value of Z was observed when the patient was given an overall rating
67700 greater than the model 6 times;a negative value of Z when the model
67800 was rated more paranoid twice.
67900
68000 TABLE 3
68100 Analysis of Variance of Z Ratings
68200 INSERT TABLE 3
68300
68400
68500
68600
68700
68800
68900
69000
69100
69200 END OF TABLE 3
69300
69400 level of guessing.
69500
69600
69700 DISCUSSION
69800 The results of this experiment indicate our simulation of
69900 paranoid pro- cesses to be successful relative to the
70000 indistinguishability tests utilized. Thus it is an acceptable
70100 simulation as measured by the standard proposed.
70200
70300 It is worth emphasizing that our test invited refutation of
70400 the model. The experimental design of the tests put the model in
70500 jeopardy of falsi- fication. If the paranoid model did not survive
70600 these tests, i.e. if it were not considered paranoid by expert
70700 judges, if there were no correlation between the weak-strong versions
70800 of the model and the severity ratings of the judges, and if they
70900 could they could distinguish actual patient inter- views from
71000 computer program interviews, then no claim regarding the success of
71100 the simulation could be made. Survival of a falsification proceedure
71200 constitutes a validating step.
71300
71400 It is of some historical significance that these experiments
71500 were conducted at all. To my knowledge no one to date has subjected
71600 his model of human mental processes to such challenging
71700 indistinguishability tests. Other competing models are needed in the
71800 field of psychopathology. These tests set a precedent and provide a
71900 standard for competing models to be measured against. The general
72000 area of computer simulation of mental processes needs not only better
72100 models but better tests and statistical measures of resemblance. The
72200 problems of appropriate critical experimental designs and measures
72300 provide a promising frontier for future work.
72400 6.2 THE MACHINE QUESTION
72500 As mentioned (p. 00), we conducted an experiment on the machine
72600 out of curiosity. For hundreds of years humans have wondered how
72700 to distinguish a man from an imitation. To distinguish a man from
72800 a statue Gallileo suggested tickling each with a feather. To distinguish
72900 a man from a machine Descartes suggested linguistic tests. Turing's
73000 proposals have been discussed on p.00.
73100 To ask the machine-question, we sent interview transcripts,
73200 one with a patient and one with PARRY, to 100 psychiatrists randomly
73300 selected from the Directory of American Specialists and the Directory
73400 of the American Psychiatric Association. Of the 41 replies 21 (51%)
73500 made the correct identification while 20 (49%) were wrong. Based on
73600 this random sample of 41 psychiatrists, the 95% confidence interval
73700 is between 35.9 and 66.5, a range which is close to chance.
73800 Psychiatrists are considered expert judges of patient
73900 interview behavior but they are unfamiliar with computers. Hence we
74000 conducted the same test with 100 computer scientists randomly
74100 selected from the membership list of the Association for Computing
74200 Machinery, ACM. Of the 67 replies 32 (48%) were right and 35 (52%)
74300 were wrong. Based on this random sample of 67 computer scientists the
74400 95% confidence ranges from 36 to 60, again close to a chance level.
74500 Thus the answer to this machine-question "can expert judges,
74600 psychiatrists aand computer scientists, using teletyped transcripts
74700 of psychiatric interviews, distinguish between paranoid patients and
74800 a simulation of paranoid processes? " is "No". Turing predicted in
74900 1950: " I believe that in about fifty years' time it will be possible
75000 to programme computers, with a storage capacity of about 10 9th, to
75100 make them play the imitation game so well that an average
75200 interrogator will not have more than 70 percent chance of making the
75300 right identification after five minutes of questioning." In 1972, 22
75400 years after Turing's prediction and allowing interviewers 20-40 I/O
75500 pairs (a better measure than real time), our model played a version
75600 of the imitation game well enough that expert judges had only a 50
75700 percent chance of making the right identification.
75800 But what do we learn from asking the machine question and
75900 finding that the answer is "no"? It is some comfort that the answer
76000 was not "yes"and the null hypothesis (no differences) failed to be
76100 rejected, especially since statistical tests are somewhat biased in
76200 favor of rejecting the null hypothesis (Meehl,1967). Yet this answer
76300 does not tell us what we would most like to know, i.e. how to
76400 improve the model. Simulation models do not spring forth in a
76500 complete, perfect and final form; they must be gradually developed
76600 over time. Pehaps we might obtain a "yes" answer to the
76700 machine-question if we allowed a large number of expert judges to
76800 conduct the interviews themselves rather than studying transcripts of
76900 other interviewers. It would indicate that the model must be
77000 improved but unless we systematically investigated how the judges
77100 succeeded in making the discrimination we would not know what aspects
77200 of the model to work on. The logistics of such a design are immense
77300 and obtaining a large N of judges for sound statistical inference
77400 would require an effort disproportionate to the information-yield.
77500 6.3 MULTIDIMENSIONAL EVALUATION
77600 A more efficient and informative way to use Turing-like tests
77700 is to ask judges to make ordinal ratings along scaled dimensions from
77800 teletyped interviews. We shall term this approach asking the
77900 dimension-question. One can then compare scaled ratings received by
78000 the patients and by the model to precisely determine where and by how
78100 much they differ. Model builders strive for a model which
78200 shows indistinguishability along some dimensions and
78300 distinguishability along others. That is, the model converges on what
78400 it is supposed to simulate and diverges from that which it is not.
78500 We mailed paired-interview transcripts to another 400
78600 randomly selected psychiatrists asking them to rate the responses of
78700 the two `patients' along certain dimensions. The judges were divided
78800 into groups, each judge being asked to rate responses of each I-O
78900 pair in the interviews along four dimensions. The total number of
79000 dimensions in this test were twelve- linguistic noncomprehension,
79100 thought disorder, organic brain syndrome, bizarreness, anger, fear,
79200 ideas of reference, delusions, mistrust, depression, suspiciousness
79300 and mania. These are dimensions which psychiatrists commonly use in
79400 evaluating patients.
79500 (INSERT TABLE 4 HERE)
79600 Table 4 shows there were significant differences, with PARRY
79700 receiving higher scores along the dimensions of linguistic
79800 noncomprehension,thought disorder, bizarreness, anger, mistrust and
79900 suspiciousness. On the dimension of delusions the patients were rated
80000 significantly higher. There were no significant differences along the
80100 dimensions of organic brain syndrome,fear, ideas of reference,
80200 depression and mania.
80300 Whereas tests asking the machine-question indicate
80400 indistinguishability at the gross level, a study of the finer or
80500 micro-structure of the model's behavior through ratings along scaled
80600 dimensions shows statistically significant differences between
80700 patients and model. These differences are of help to the model
80800 builder in suggesting which aspects of the model must be modified and
80900 improved in order to be considered an adequate simulation of the
81000 class of paranoid patients it is intended to simulate. It is evident
81100 from the graph of Fig.000 that no modifocations are necessary along
81200 the dimension of "organic brain syndrome. But it is clear that the
81300 model's language-comprehension must be improved. Once this has been
81400 implemented, a future test will tell us whether improvement has
81500 occurred and by how much in comparison to the earlier version.
81600 Successive identification of particular areas of failure provides a
81700 type of sensitivity analysis which makes clear what improvements must
81800 be pursued in developing more adequate model versions.
81900 Further evidence that the machine-question is too coarse and
82000 insensitive a test comes from the following experiment. In this test
82100 we constructed a random version of the paranoid model which utilized
82200 PARRY'S output statements but expressed them randomly no matter what
82300 the interviewer said. Two psychiatrists conducted interviews with
82400 this model, transcripts of which were paired with patient interviews
82500 and sent to 200 randomly selected psychiatrists asking both the
82600 machine-question and the dimension-question. Of the 69 replies, 34
82700 (49%) were right and 35 (51%) wrong. Based on this random sample of
82800 69 psychiatrists, the 95% confidence interval ranges from 39 to 63,
82900 again indicating a chance level. When a poor model, such as a random
83000 one, passes a test, it suggests the test is weak. In the case where
83100 even a random model cannot be distinguished, we can conclude asking
83200 the simple machine question does not offer a severe enough challenge.
83300 Although a distinction is not made when "which is the machine?" is
83400 asked, definite distinctions ARE made when judgements are requested
83500 along specific dimensions. As shown in Table 5, significant
83600 differences appear along the dimensions of linguistic
83700 noncomprehension, thought disorder and bizarreness, with RANDOM-PARRY
83800 rated higher. On these particular dimensions we can construct a
83900 continuum in which the random version represents one extreme, the
84000 actual patients another. Our (nonrandom) PARRY lies somewhere between
84100 these two extremes, indicating that it performs significantly better
84200 than the random version but still requires improvement before being
84300 indistinguishable from patients.(See Fig.1-graph). Table 6 presents t
84400 values for differences between mean ratings of PARRY and
84500 RANDOM-PARRY. (See Table 5 and Fig.1 for the mean ratings).
84600 Thus it can be seen that such a multidimensional evaluation
84700 provides yardsticks for measuring the adequacy of this or any other
84800 dialogue simulation model along the relevant dimensions.
84900 We conclude that when model builders want to conduct tests of
85000 adequacy which indicate in which direction progress lies and to
85100 obtain a measure of whether progress is being achieved, the way to
85200 use Turing-like tests is to ask expert judges to make ratings along
85300 multiple dimensions that are essential to the model. A good
85400 validation procedure has criteris for better or worse approximations.
85500 Useful tests do not prove a model, they probe it for its strengths
85600 and weaknesses and clarify what is to be done next in modifying and
85700 repairing the model. Simply asking the machine-question yields little
85800 information relevant to what the model builder most wants to know,
85900 namely, along what dimensions must the model be repaired and improved.
86000
86100